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Abstract 

The study of paleopolyploidies requires the comparison of multiple whole genome sequences. If the branches of a phylogeny 
on which a whole-genome duplication (WGD) occurred could be identified before genome sequencing, taxa could be 
selected that provided a better assessment of that genome duplication. Here, we describe a likelihood model in which the 
number of chromosomes in a genome evolves according to a Markov process with one rate of chromosome duplication and 
loss that is proportional to the number of chromosomes in the genome and another stochastic rate at which every 
chromosome in the genome could duplicate in a single event. We compare the maximum likelihoods of a model in which the 
genome duplication rate varies to one in which it is fixed at zero using the Akaike information criterion, to determine if 
a model with WGDs is a good fit for the data. Once it has been determined that the data does fit the WGD model, we infer 
the phylogenetic position of paleopolyploidies by calculating the posterior probability that a WGD occurred on each branch 
of the taxon tree. Here, we apply this model to a molluscan tree represented by 124 taxa and infer three putative WGD 
events. In the Gastropoda, we identify a single branch within the Hypsogastropoda and one of two branches at the base of 
the Stylommatophora. We also identify one or two branches near the base of the Cephalopoda. 
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Introduction 

Polyploidy has long been recognized as an important mech- 
anism of genetic evolution (Taylor and Raes 2004), and over 
the last decade, as full genome sequences have become 
available, a great deal of research has been invested into 
the analysis of whole-genome duplications (WGDs) (e.g., 
Byrne and Blanc 2006; Semon and Wolfe 2007). Polyploid 
species and individuals are common in plants (Wendel 
2000); evidence is accumulating that WGDs also occur in 
the opisthokonts and have played an important role in 
the evolution of their genomes (McLysaght et al. 2002; 
Vandepoele et al. 2004; Wolfe 2004). In contrast to other 
modes of genome evolution, polyploidy events affect the 
entire genome at once. By duplicating every gene in the 
genome, a large amount of redundant genetic information 
is created, which can be used as raw material for evolution- 
ary innovations (Haldane 1932; Ohno 1967). It has been 
suggested that in several cases the modification of this 
raw material has been important in the evolution of key in- 
novations, such as glucose fermentation in yeast (Piskur 



2001) and the immune system of vertebrates (Kasahara 
2007). 

Despite the large effort put into the analysis of genome 
duplications, the identification and confirmation of such du- 
plications, especially ancient ones, has proved problematic. 
To conclusively demonstrate a paleopolyploidy event, sev- 
eral complete genomes must be sequenced both from taxa 
that are directly descended from the original polyploid indi- 
vidual and from their relatives that diverged shortly before 
the WGD (Wong et al. 2002; Woods et al. 2005). Even to 
identify likely cases of paleopolyploidy large numbers of 
genes must be sequenced in several closely related taxa 
(Spring 1997; Blanc and Wolfe 2004), and although se- 
quencing costs continue to decline, it remains costly in terms 
of laboratory time and materials (Ansorge 2009). 

The comparison of lineages that diverged shortly before 
and after the WGD is critical to the accurate reconstruction 
of the event (Scannell et al. 2007) (fig. 1 ). Thus, it would be 
beneficial to identify putative positions of a WGD before ini- 
tiating research into the effects of paleopolyploidy. Here, we 
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Fig. 1. — An ideal sampling of tinree taxa for investigating the 
genome duplication indicated by the dash. By selecting one taxon each 
from clades A, B, and C, the investigator would minimize the amount of 
shared history either before or after the event and thus have the best 
chance of accurately reconstructing the effects of the duplication on the 
genome. 

show that by using karyotypic data we are able to increase 
taxon sampling to accurately identify the branches on which 
the duplication occurred and thus guide the selection of taxa 
for genome sequencing. 

Although it is possible to identify clades within a phylog- 
eny that have approximately twice the number of chromo- 
somes as their close relatives, such distributions of 
chromosome numbers may be the consequence of 
a WGD or of ordinary processes of aneuploidy In order 
to distinguish the background rate of change in chromo- 
some number from doublings caused by WGDs, we used 
a likelihood method in which the background rate of chro- 
mosome number evolution is modeled as a birth-death pro- 
cess. The birth-death process is a stochastic model often 
used to compare the numbers of genes in gene families be- 
tween organisms (Lynch and Conery 2003; Hahn et al. 
2005; Novozhilov et al. 2006). Under this process, the rate 
of chromosome duplication or loss is proportional to the 
number of chromosomes in the genome. This seems an ap- 
propriate model as aneuploidy is usually a consequence of 
nondisjunction; thus both gains and losses are likely to occur 
at approximately equal rates for any random chromosome. 

We used a model in which chromosome number evolved 
by the birth-death process alone as a null model and com- 
pared it with a duplication model in which the number of 
chromosomes could also double in a single stochastic event. 
The maximum likelihoods for a set of chromosome counts 
were calculated under each model and compared with sim- 
ulations in order to determine whether the birth-death pro- 
cess alone was a reasonable fit for the data, whether the 
duplication model was a significantly better fit than the 
birth-death process alone, and whether the birth-death 
process was a sufficient approximation of the background 



rate of chromosome number evolution in the presence of 
WGDs. Furthermore, the parameters inferred under the du- 
plication model were used to calculate the posterior prob- 
ability that a duplication occurred on each branch of the 
tree. Mayrose et al. (2010) recently published a similar likeli- 
hood method. They used models in which the rates of chro- 
mosome increase and decrease were independent of the 
number of chromosomes and models in which those rates 
were linearly related to the number of chromosomes; the 
birth-death process is a special case of the latter model. 

Here, we analyze the Mollusca for the occurrence of 
WGDs. The Mollusca is a large and disparate clade with 
members that play an important role in marine and terres- 
trial ecosystems. Furthermore, they are one of the most di- 
verse groups within the Lophotrochozoa, the least studied 
of the three major clades of bilaterian animals. Currently, 
genomics studies in Mollusca are still preliminary with fewer 
than five genomes sequenced or in progress (Chapman et al. 
2007). Natural polyploid species of molluscs have been rec- 
ognized for decades (e.g., Patterson 1969; Goldman et al. 
1 983), and it has been suggested that the Stylommatophora 
(Steusloff 1942) and the Neogastropoda (Hinegardner 
1 974) are polyploid. By determining which molluscan clades 
are polyploid, we can identify potential taxa for investigation 
and help to guide future research into molluscan genomics. 

Materials and Methods 

Phylogeny 

No single phylogenetic study has included all the taxa in our 
analysis, so we used phylogenies from several sources to as- 
semble a tree for the Mollusca and then pruned taxa for 
which we did not have chromosome data. Although a tree 
constructed from a single analysis would have been pre- 
ferred, such an analysis is beyond the scope of this study 
We therefore used a reverse compartmentalization method 
to generate the tree used here (Mishler 1994; Nixon et al. 
1994). We did this by identifying published trees that rep- 
resented the terminal taxa in Haszprunar's (2000) molluscan 
phylogeny and in Ponder and Lindberg's (1997) gastropod 
phylogeny We assumed that each of these terminal taxa 
represented a compartmentalized (or collapsed) monophy- 
letic tree and therefore by placing the published trees at 
each of the respective tips we were simply expanding the 
terminal taxon. Mishler (1994) has shown this method is ro- 
bust in representing diverse yet clearly monophyletic clades 
in larger scale cladistic analyses, and we simply reversed this 
approach by expanding rather than collapsing each terminal 
clade using a published phylogeny Trees for the Polyplaco- 
phora (Okusu et al. 2003), Bivalvia (Giribet and Wheeler 
2002), and Cephalopoda (Lindgren et al. 2004) were then 
placed on the Haszprunar backbone. Within the gastropod 
tree. Barker (2001) was used for the phylogeny within the 
Heterobranchia and Wade et al. (2006) for more detailed 
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relationships annong the pulnnonate families. Wagele et al. 
(2008) provided branching patterns within the Opisthobran- 
chia, and Colgan et al. (2007) provided placement for ad- 
ditional families not included by Wagele et al. (2008). 
Pruning of trees in our reconstruction was done when chro- 
mosomal data for the taxa included in the original phyloge- 
netic analyses were not available, as the method we 
employed to calculate WGDs required chromosomal data 
for every taxon. The presence of additional data may affect 
the results, as it may for any scientific analysis. 

All likelihood models require branch lengths to estimate 
the probability of a transition along a given branch, and any 
rates used in such models are proportional to the units of 
branch length. Therefore, we assumed that under a null 
model the rates of chromosome duplication and loss were 
constant with respect to time and therefore the branch 
lengths for the tree should be in years between speciation 
events. We deduced the timing of each node in our tree by 
identifying the first occurrence for each of the terminal taxa 
in our study as well as for larger clades containing one or 
more of these taxa. These dates were determined from 
the Paleobiology Database (www.paleodb.org) with the fol- 
lowing exceptions: Nishiguchi and Mapes (2008) provided 
first occurrence dates for the cephalopods, and Solem 
and Yochelson (1979) and Zilch (1959-1960) were used 
for the pulmonate gastropods. Every node was fixed at 
the oldest date for the first appearance of any of its descend- 
ants. This method resulted in several branches of length 
zero. 

We could not assign a length to a terminal branch if it led 
to one of two sister taxa with no fossil occurrence nor to an 
internal branch if one of the two clades immediately de- 
scended from it had a longer fossil history than the internal 
branch's sister clade. These discrepancies likely resulted from 
the vagaries of the fossil record and do not represent the 
actual time between lineage splitting events. Therefore, 
we assumed that each node had preceded both its daughter 
nodes by at least some minimum number of years. Three 
trees were constructed with zero length branches 
expanded to 10^ 10^, and 10® years, and we will refer 
to these as Tree-10^ Tree-10^, and Tree-IO'", respectively. 

Chromosome Number 

Karyotypic data and gastropods were mined from the com- 
prehensive reviews of Patterson (1 969), Patterson and Burch 
(1 978), Nakamura (1 986), and Thiriot-Quievreux (2003). We 
derived karyotype data for the bivalves from Patterson 
(1969), Nakamura (1985), and Thiriot-Quievreux (2002). 
Karyotype data sources for the Cephalopoda included Na- 
kamura (1 985), Gao and Natsukari (1 990), and Vitturi et al. 
(1990). Polyplacophoran data was compiled from Odierna 
et al. (2008) and chromosome numbers for the Scaphopoda 
from leyama (1993). When karyotype data for multiple 



species within a terminal taxon was found, we used the 
mode for that taxon (White 1973). If there were multiple 
modes, we used the mode closest to the median, and if 
there were two modes equally distant from the median, 
we chose one at random. 

Phylogenetic Signal 

To test for phylogenetic signal, we randomized the karyo- 
type data among the tips 9,999 times and calculated the 
number of steps for the randomized data sets using or- 
dered parsimony and a parsimony model for which the 
cost for going from m chromosomes to n chromosomes 
along a branch is the absolute value of ln(n) - ln(m). Un- 
der this second parsimony model, the cost of adding or 
losing chromosomes is proportional to the number of 
chromosomes in the genome (as it is under the birth- 
death process) but it still retains the quality of an ordered 
parsimony model because the cost of going from state x to 
state z equals the cost of going from state x to state y plus 
the cost of going from state y to state z, when y is inter- 
mediate in value between x and z. We then compared the 
number of steps for our randomized data sets to the num- 
ber of steps for our actual data. Because there should be 
fewer steps between closely related individuals than dis- 
tantly related ones, there will be fewer steps under either 
parsimony model for a data set with high phylogenetic 
signal compared with a random data set. 

Likelihood Model 

For our null hypothesis, we assumed that each chromosome 
has an equal probability of duplicating or splitting and an 
equal probability of being lost at any time. Thus, there is 
a constant duplication rate, X, and loss rate, |i, for each chro- 
mosome in the genome. Hallinan (2011) showed how to 
calculate the probability that m,- chromosomes at the begin- 
ning of branch / of the taxon tree will leave A/, chromosomes 
at the end of that branch, assuming that all m, 
chromosomes survive to the end of that branch. 



Pi(A/,|m,) 



N, 
mi 



1 



(1 



where 



[• exp((>.-n)f,)-1 
: J exp((>.-n)t/)-a 
I 

[ kti + 1 



if?^ ^ |I 
if?^ = |X 



m, Nj - nij 



f/ is the length of branch / and a=^/X. We also know from 
Kendall (1948) that if f, is the probabilitythat a chromosome 
present at the base of branch / is lost by the end of that 
branch, then f,=aL;,. 

Using this result, we can now calculate the probability 
that /W, chromosomes at the beginning of the branch leave 
Nj chromosomes at the end of branch /, allowing for the 



1152 



Genome Biol. Evol. 3:1150-1163. doi:10.1093/gbe/evr087 Advance Access publication August 22, 201 1 



Molluscan Paleopolyploidy 



GBE 



possibility that any of the initial chromosomes could be lost. 
In this case, m, is the number of chromosomes in the initial 
group of Mi chromosomes that give rise to the A/,- chromo- 
somes at the end of the process. J^e other /W, - m, chro- 
mosomes are lost. There are ( ' ) different ways to 
arrange /W, - m, lost chromosomes ^mong M, initial chro- 
mosomes. Therefore, we can calculate this probability by 
summing over all the possible numbers of surviving 
chromosomes. 

min(M/,A/,) / n/i \ 

P2{N,\M,)= E rT' fr'-^'PiWim,) 

m, = 1 V'"'/ 

min(M„A/,) / /w. _ -i \ (2) 

- (rOft -',)[<'-».)(■ -".r 

x{au,f'-'"'uf-'"'. 

This equation is equivalent to the one given by Bailey (1 964, 
p. 94) and Foote et al. (1999), although it takes a different 
form, and here, we provide an alternative derivation. We 
can assume that all the chromosomes were not lost on 
any branch of the tree, so that the transition probability be- 
tween any two states M, and A/,- is the probability of going 
from M/ to A// chromosomes along branch / of length f,-, given 
that A/, is at least one. 

"Mm)-'^. (3) 

We used this equation to calculate the probability of a set 
of chromosome counts at the tips of a phylogeny condi- 
tioned on a particularvaluefor and the number of chro- 
mosomes at the root. We calculated the likelihood for the 
entire tree by proceeding down from the tips of the tree to 
the root and marginalizing over all the possible states at the 
internal nodes (Felsenstein 1973, 1981), such that we cal- 
culated the probability of the data above branch / given 
Mj by summing over all the possible values of A/,-. 

GO 

P{q\M,) = J2 P,(A/,|H)P(Q|A/,), (4) 

W, = 1 

where p,(A/,|M,) is the transition probability of going from M,- 
to Nj chromosomes and C, is the chromosome counts on the 
tips above branch /. If we define /+ and ;~ as the two 
branches immediately descended from /, then 
Ni=Mi+=Mr and P(Q|A/,) can be calculated as 
P{Ci+\Mi+)P{Cr\Mi^), which can each in turn be calculated 
with equation (4). We also calculated /-/,(X), the highest in- 
teger such that P{Ci\Ni=Hi{X))>P{Q\Ni=Hi{X) + ^) and 
exp(X)xP(C,|A/,=H,(X))>X]„,^n,P(C,|A/,) where is the 
set of all the values of A/,- for which P(C,|A/,) has already been 
calculated. As there are an infinite number of possible states 
at each internal node, we did not calculate probabilities for 
values of A/, that were greater than 200 or greater than 



H/+(5), H,-(5), and H,(15). We ran multiple tests in order 
to confirm that these values were set conservatively and 
did not result in lower likelihoods. In this case, we assumed 
that all evolution of chromosome numbers occurred by the 
birth-death process, so that p,(A/,|/W,)=P3(A/,|M,). 

We added WGDs to this model by assuming that they 
occur at some constant rate, 8, and that no more than 
one WGD occurred on any branch. We call the number 
of genome duplications that happened on branch /, D,-. 
The transition probability of going from M, to A/,- chromo- 
somes along a branch with no full genome duplications is 
simply the transition probability from the birth-death pro- 
cess times the probability that no genome duplication 
occurred. 

P^Ni, D, = Oj/W,) = exp(- 5f,)P3(A/,|/W,). (5) 

In order to calculate the transition probability of going 
from Mi to A/, chromosomes along a branch with one full 
genome duplication, we divide the branch into two discrete 
time periods at the time of the duplication, fg. In that case, 
the number of chromosomes before a duplication is A/g and 
the time between the start of the branch and the duplication 
is tj - U; after the duplication, there will be 2A/g chromo- 
somes. We can calculate the transition probability by sum- 
ming over all the possible values of A/g and integrating over 
fg from zero to f,. 

P4(/V„D = 1|/W,) =8exp(-8f,) 

X /o E P3(/Vg|/W,)P3(/V,|2A/g)dfg. 

/V5 = 1 

We calculated this integral using numerical integration, 
breaking each branch into sections and assuming that 
the duplication happened in the middle of that section. 
We used three sections, as we found that three 
sections yielded the same results as ten sections. We calcu- 
lated the sum by treating the duplication event as an internal 
node and limiting the number of possible states as above. 

Thus, we can calculate the likelihood of specific values of 
X, and 8 conditioned on the numbers of chromosomes at 
the root using equation (4), except we now calculate the 
transition probability as p,(A//|/W,)=P4(/\/,|/W,), where: 

P4(A/,|/V/,) « P4(A/„ D, = 0\Mi) + P4(A/,-, D, = 1 |/W,). (7) 

This excludes the possibility of multiple duplications on any 
branch, which is in general very small. These calculations 
were performed using the program GDCN 1.0. Windows 
executables and code are available from http://code. 
google. com/p/gdcn/. 

Model Comparison 

We compared three different models by fitting the 
parameters of those models to our data set by maximum 
likelihood. The "static" model is the simplest model and 
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assumes that X=[i and that 5=0. In order to show that our 
data are better explained by including WGDs, we compared 
the static model to a "duplication" model in which 5 is al- 
lowed to take values greater than 0. It is possible that the 
superior fit of the duplication model is a consequence of 
a strong tendency for the number of chromosomes to in- 
crease rather than to paleopolyploidy events. To account 
for this possibility, we also compared these two models 
to the "trend" model that allows X and |.i to take different 
values but assumes that 5=0. We examined all three models 
on all three trees in order to assure that no set of branch 
lengths was unduly affecting our conclusions. 

Our calculation of the likelihood is conditioned on the 
number of chromosomes at the root. Because we do not 
know the actual number of chromosomes in the common 
ancestor of all molluscs, we treated the ancestral chromo- 
some count as another parameter to be set by our maximum 
likelihood search. In order to justify this approach, we ini- 
tially calculated maximum likelihoods conditioned on 
a range of ancestral chromosome values from 1 to 80. By 
comparing a large range of ancestral mollusc chromosome 
counts, we could see how strongly our model supported the 
maximum likelihood reconstruction of the root and be cer- 
tain that our model choice was not overly biased by our 
reconstruction of the ancestral chromosome count. 

The maximum likelihoods for all models under all sets of 
root values were used to calculate the Akaike Information 
Criterion (AlC) for those models. The AlC is defined as twice 
the number of parameters in the model minus twice the max- 
imum likelihood. The AlCs of the different models were com- 
pared in order to select the best fit model for the data. Models 
that fit the data better have lower AlCs (Akaike 1 974). 

Identifying Branches with Duplications 

In order to identify branches on which it is likely that a WGD 
occurred, we calculated the posterior probability that there 
was a duplication on each branch of the tree. For each 
branch, /, we used the maximum likelihood values of X, ii, 
5 and the ancestral chromosome count to calculate two like- 
lihoods: P{Cr,Di=^Nr) and P{Cr,D;=Q\Nr), where branch r 
is the root of the tree. To make these calculations, we assume 
thatp^(A/y|My) is P4{Nj\Mj) on every branch of the tree except 
/■ where p/(A//|/W,) is P4(A//, D, = 1 |/W,) or P4(A/,-, D,=0|M/), ac- 
cording to which likelihood we want to calculate. In that case, 
P{Cr\Nr) « P(Q, D| = ^\Nr) + P(Q, D,=0|/Vr) and the poste- 
rior probability of a duplication on that branch will be 
PiDi=^\Cr,Nr)= P{Cr,Di=]\Nr)/P{Cr\Nr). These values 
were calculated for all three trees, using the maximum 
likelihood values for X, \i, 5, and the root. 

Simulations and Model Fit 

In addition to demonstrating which of these three models 
fits the data better, it is also necessary that these models 



provide a reasonable deschption of the data themselves. 
In order to make this comparison, we simulated 1 ,000 data 
sets on Tree-1 0^ under the static and the duplication model 
using maximum likelihood parameter values. Maximum like- 
lihoods were then calculated for each of these data sets un- 
der the models used to generate the data. These maximum 
likelihoods were then compared with the maximum likeli- 
hood of our data to potentially reject the hypothesis that 
the data were generated by this model. Failure to reject this 
hypothesis was taken to indicate that the model was 
a reasonable approximation of the actual process by which 
the real data were generated. The data sets simulated 
under the static model were also evaluated under the 
duplication model, and the difference between the max- 
imum likelihoods under each model for each data set were 
compared with our data set in order to confirm that 
the duplication model was a significantly better fit for 
the data. 

We also considered it important to demonstrate that the 
birth-death process was an appropriate model for the 
background rate of chromosome number evolution inde- 
pendently of where any WGDs occurred. To do this, we 
simulated 1 ,000 data sets on Tree-1 0^ using the maximum 
likelihood values for the root count, X and \i calculated un- 
der the duplication model. Under these simulations, WGDs 
did not occur at random but instead always occurred on 
the branches for which our actual data showed a high pos- 
terior probability of a WGD. In order to test the effects that 
a duplication on a minimum length branch would have on 
the data set, we also simulated 1 ,000 data sets using max- 
imum likelihood values for the root count, X, and 5 on 
Tree-1 0^ and placed one duplication at random on a min- 
imum length branch. We calculated maximum likelihoods 
for these simulated data sets using the duplication model 
and compared those likelihoods with the one calculated for 
our actual data in order to potentially reject the use of the 
birth-death process for our background rate of chromo- 
some evolution. 

Results 

Phylogeny, Chromosome Counts, and Signal 

The topology for the phylogeny we used is shown in 
figure 2a. Most traditional molluscan clades are monophy- 
letic in this tree but both Caenogastropoda and Sigmurethra 
are paraphyletic. Figure 2b shows the same tree with the 
branch lengths scaled to match the branch lengths we derived 
from paleontological data. In this second tree, 68 of 122 in- 
ternal branches had zero branch lengths, as a consequence of 
basal branching taxa in a clade having a younger fossil occur- 
rence than taxa nested within the clade, and 6 of 1 24 terminal 
branches also had zero branch lengths because two sister taxa 
had no fossil occurrences. Zero length branches are shown as 
polytomies on the tree. 
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Fig. 2. — (a) The cladogram of the Mollusca used in this study. Several clades important to this study are identified with brackets, (b) The phylogeny 
of molluscan taxa used in this study The topology of this tree is the same as the tree in (a), but here branch lengths are shown in millions of years as 
derived from the fossil record using the Paleobiology Database. Zero length branches are shown as polytomies. 



Chromosome counts were obtained for 997 molluscan 
species within 124 terminal taxa and included members 
of five major extant classes (Polyplacophora, Bivalvia, Ceph- 
alopoda, Scaphopoda, and Gastropoda) (supplementary ta- 
ble, Supplementary Material online). Chromosomal data 
were unavailable for the Monoplacophora and Aplacophora 
(Chaetodermomorpha and Neomeniomorpha). When mul- 
tiple values were available for a terminal taxon, the mode of 
these chromosome numbers was used to represent the ter- 
minal clade in our phylogenetic analysis. In most cases, all 
the chromosome counts in each terminal clustered around 
the modes, but several clades have members with highly 
divergent counts. For example, within the Anomioidea, 



Loliginidae, Viviparoidea, Thiaridae, Ancylidae, and Planor- 
bidae, there are taxa with chromosome counts two times 
the mode for the entire clade implying recent polyploidies. 
In the Planorbidae genus Bulinus, there are species with 
three and even four times the mode of the clade. Further- 
more, within the Unionoida, Cardioidea, Turbinoidea, Cer- 
ithiidae, Pleuroceridae, Littorinidae, Muricidae, Conoidea, 
and Succineidae, there are species with chromosome 
counts approximately half of the mode. Most of these 
clades are nested in larger clades with modes similar to 
their own. 

Chromosome number has very high phylogenetic signal 
among the molluscan taxa studied, as demonstrated by two 
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Table 1 

Maximum Likelihood Parameter Values for All Three Models and Sets of Branch Lengths 



Branch Lengths 


Model 


AlC 


Root Count 


1+ 


X — [I 


5 


Tree- 10*^ 


Equilibrium 


364.540 


17 


5.0359 


0" 


0= 




Trend 


366.504 


16 


4.9772 


0.06851 


0" 




Duplication 


357.508 


15 


3.0555 


0= 


0.12304 


Tree- 10= 


Equilibrium 


366.792 


17 


5.2310 


0= 


0= 




Trend 


368.719 


20 


5.4529 


-0.34818 


0" 




Duplication 


361.744 


15 


3.4038 


0" 


0.11661 


Tree- 10" 


Equilibrium 


369.134 


17 


5.2577 


0^ 


0= 




Trend 


371.043 


20 


5.4787 


-0.34931 
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different parsimony statistics we derived from our data set 
and compared with the same statistics calculated for data 
sets generated by randomizing our data over the tips of 
the trees. Our data set had 250 steps for ordered parsimony 
and 13.36 steps for our weighted step matrix. Both these 
values were highly significant (P << 0.0001), as the 
9,999 randomized data sets had from 558 to 749 steps 
for ordered parsimony and from 27.49 to 37.12 steps for 
our weighted step matrix. 

Model Choice 

The duplication model achieved its maximum likelihood at 

1 5 ancestral chromosomes for all three trees, whereas the 
static model achieved its maximum likelihood at 1 7 (table 1 ). 
For both these models, likelihoods fell off precipitously for all 
trees when conditioned on ancestral chromosome counts 
that differed from their maximum likelihood chromosome 
count by more than 2 (fig. 3). The trend model reached 
its maximum likelihood at chromosome counts between 

1 6 and 20 for the three different trees and although the like- 
lihoods decreased precipitously for lower ancestral chromo- 



some counts, they decreased more gradually for higher 
counts. 

The AlC was much lowerforthe duplication model than it 
was for either other model when conditioned on ancestral 
chromosome counts near the maximum for all three trees 
(fig. 3). Furthermore, when the ancestral chromosome 
count was treated as an additional maximum likelihood pa- 
rameter the AlC was much lower for the duplication model 
than it was for either other model (table 1). However, the 
trend model achieved relatively high maximum likelihoods 
over a much larger range of ancestral chromosome counts 
and when conditioned on more extreme chromosome 
counts it had lower AlCs than either of the other two models 
(fig. 3). Nevertheless, the duplication model isclearly a better 
fit for the data as its AlCs were so much lower for all three 
trees when the ancestral chromosome counts were fit by 
maximum likelihood. 

There was little variation in the maximum likelihood es- 
timates of parameter values between the three sets of 
branch lengths (table 1). Maximum likelihood estimates 
for the total rate of change for the birth-death process were 



Tree -10^ Tree -10^ Tree -10" 




0 10 30 50 70 0 10 30 50 70 0 10 30 50 



Chromosomes in Ancestral Mollusc 

■ ■ Trend Model 
— Equilibrium Model 
Duplication Model 

Fig. 3. — The AlC for each of three models of chromosome number evolution in extant mollusc families. Each figure shows the AlCs for all three 
models conditioned on the number of chromosomes found in the last common ancestor of all Mollusca for a given set of branch lengths. 
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Fig. 4. — The phylogeny of all the terminal taxa in the Hypsogastropoda 
using the branch lengths from Tree-10^ showing the posterior probability of 
a WGD on each branch. The Capulidae-Neogastropoda branch is marked, as 
it has a very high posterior probability of a WGD for all sets of branch lengths, 
whereas all the other branches have essentially none. 

much smaller under the duplication model than under the 
other two models because under the duplication model 
a great deal of the change in chromosome numbers can 
be accounted for by WGDs. The estimate of net change 
in chromosome number under the trend model was slightly 
positive for tree-10^; the estimate of net change was neg- 
ative for the other two trees but still only represented 6.4% 
of the total rate. Our estimates of the WGD rate under the 
duplication model ranged from 0.109 duplications/billion 
years for Tree-10^ to 0.123 duplications/billion years for 
Tree-10^. It should be noted that this is the WGD rate that 
would be expected per lineage not over the whole tree; 
thus, we observed several WGDs in a clade with a 530 
million year history. 

Branches with Duplications 

Three branches had posterior probabilities of WGDs greater 
than 0.67 for at least one set of branch lengths when con- 
ditioned on their maximum likelihood parameter values. 
These included the branch at the base of the coleoid ceph- 
alopods, a branch within the stylommatophoran gastropods 
at the base of a clade containing the Sigmurethra and the 



Orthurethra and a branch within the hypsogastropods at the 
base of an unnamed clade. Several branches phylogeneti- 
cally close to some of these we 1 1 -sup ported branches also 
showed some support for a WGD (figs. 4-6). No other 
branch had a posterior probability greater than 0.03 for 
any of the trees. 

The branch within the Hypsogastropoda at the base of 
a clade sister to the Strombidae and containing the Neogas- 
tropoda and several families of Littorinimorpha — hereafter 
to be referred to as the Capulidae-Neogastropoda 
branch — had posterior probabilities of a WGD greater than 
0.999 for all three trees (fig. 4). This is extremely strong 
support for an evolutionary scenario in which the number 
of chromosomes doubled on this branch. 

We calculated the posterior probability for a WGD on 
a branch within the Stylommatophora at the base of a clade 
containing the Sigmurethra and the Orthurethra of 0.881 
for Tree-1 0^ (fig. 5). In contrast, Tree-1 0^ had much less sup- 
port for a duplication on this branch and Tree-1 0^ had even 
less. Tree-1 0^ and Tree-1 0^ alternatively showed some sup- 
port for a WGD on the branch at the base of the Stylomma- 
tophora, whereasTree-10^ weakly supported a WGD on this 
branch. There was no support for WGDs occurring on both 
these branches; the posterior probability was less than 1 x 
1 0"^ for all sets of branch lengths. The Sigmurethra-Orthur- 
ethra branch has a minimum branch length, and as a conse- 
quence, the likelihood of a duplication on this branch was 
less on trees with smaller minimum branch lengths. This 
greatly decreased the support for a doubling on this branch 
and instead compensated in part by increasing the support 
for a doubling on the ancestral stylommatophoran branch, 
which is 80 My long and thus much more likely to have 
a WGD on it. This appears to be an effect of our branch 
lengths, thus the analysis on Tree-1 0®, which is less affected 
by branch lengths, is likely more robust and suggests that 
a paleopolyploidy event occurred on the Sigmurethra-Or- 
thurethra branch. 

Our analysis clearly supports at least one paleopolyploidy 
event in the Cephalopoda, but it was difficult to distinguish 
whether there were one or two WGDs and the exact 
branch on which they occurred (fig. 6a). All three sets of 
branch lengths had a posterior probability of a WGD on 
the Coleoidea branch between 0.675 and 0.687 and on 
the Decapodiformes branch between 0.288 and 0.303 
(fig. 6b), with weak support for duplications occurring on 
both of the branches (posterior probability between 
0.030 and 0.046), meaning that the posterior probability 
that a WGD occurred on one of these branches was greater 
than 0.93 for all three sets of branch lengths. These trees 
also had posterior probabilities less than 0.09 that 
a WGD occurred either at the base of the Cephalopoda 
or on the Nautilidae branch with an overall posterior prob- 
ability between 0.078 and 0.091 that a WGD occurred on 
more than one branch in this clade. 
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Fig. 5. — Posterior probabilities for a WGD on two branciies near 
the base of the Stylommatophora. (a) The phyiogeny of aii the terminal 
taxa in a clade containing the Stylommatophora, the Systellommato- 
phora, and the Ellobiidae using the branch lengths from Tree-10^ 
showing the posterior probability of a WGD on each branch. The 
Stylommatophora branch and the Sigmurethra-Orthurethra branch are 
marked, as they have a relatively high posterior probability of a WGD, 
whereas all the other branches have essentially none, (fa) A bar plot 
showing the posterior probability of a WGD on either of these branches 
as well as on each of these branches individually under each of the three 
different sets of branch lengths. 
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Fig. 6. — Postehor probabilities for a WGD on branches within the 
Cephalopoda, (a) The phyiogeny of all the terminal taxa in the 
Cephalopoda with Gastropoda and Scaphopoda included as an 
outgroup using the branch lengths from Tree-10^ showing the posterior 
probability of a WGD on each branch. Two branches with relatively high 
posterior probability of a WGD are labeled above the branch, (fa) A bar 
plot showing the posterior probability of a WGD on various branches in 
the phyiogeny The dark bars show the posterior probability of a WGD 
on the specified branch or branches. The light bars show the posterior 
probability that there were WGDs on any pair of branches in the 
Cephalopoda including the specified branch or branches. The posterior 
probability is shown by the large bar for Tree-1 0^, by the lower error bar 
for Tree-1 0*, and by the upper error bar for Tree-1 0^. 



Simulations to Evaluate Model Fit 

We ran 1,000 simulations on Tree-1 0® under tlie static 
model using our maximum lil<elihood parameter values 
for a process conditioned on 1 7 ancestral mollusc chromo- 
somes. When evaluated under the static model, 160 of 
these simulations had maximum likelihoods lower than 
our data set, implying that we cannot reject this model 
as an explanation for our data (P = 0. 1 60). We also calcu- 
lated the maximum likelihood for all simulated data sets 
under the duplication model and used that value to calcu- 
late a log likelihood ratio. The largest log likelihood ratio for 
our simulated data sets was 5.506 and the likelihood ratio 
for over 96% of our simulations was less than 0.001, 
whereas the likelihood ratio for our actual data was 
18.064. Thus, we can strongly reject our null hypothesis 
in favor of our alternative hypothesis of genome duplica- 
tions (P << 0.001). 
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We also ran 1,000 simulations on Tree-10^ under the 
duplication model using the maximum likelihood parameter 
values for a process conditioned on 15 ancestral mollusc 
chromosomes. When evaluated under the duplication 
model, 40 of these simulations had maximum likelihoods 
less than our actual data, indicating that we can tentatively 
reject our alternative hypothesis as a fit for our model (P = 
0.040). 

However, one of the reconstructed genome duplications 
from our actual data appears to have occurred on a mini- 
mum length branch. Given the estimated value for 8 of 
0.123 WGD/billion years, the probability of a WGD on so 
short a branch is approximately 1.23x1 0"^. Thus, any data 
set with a duplication on a minimum length branch will ob- 
viously have a lower maximum likelihood than one without. 
Given this estimated value of 8, we would expect WGDs to 
occur on a minimum length branch in 0.943% of our sim- 
ulations, and indeed, they actually did occur in only nine of 
our simulations. Because the vast majority of our simulations 
did not have a duplication on a minimum length branch, we 
would expect them to have higher likelihoods than our ac- 
tual data. Indeed, 1 66 of 1 ,000 simulations in which at least 
one duplication occurred on a minimum length branch had 
a maximum likelihood lower than our data set. Furthermore, 
251 of 1,000 simulations in which duplications always oc- 
curred on the Capulidae-Neogastropoda branch, the Co- 
leoidea branch, and the Sigmurethra-Orthurethra branch, 
and nowhere else had maximum likelihoods less than our 
actual data set. Thus, we could not reject the birth-death 
process as a model for our background rate of chromosome 
evolution. 

Discussion 

The likelihood model developed here and based on the 
birth-death process allows us to predict the phylogenetic 
position of paleopolyploidy events through comparative 
analysis of chromosome counts in extant species. When ap- 
plied to a data set of chromosome numbers for the Mollus- 
ca, we found that a scenaho in which the total number of 
chromosomes occasionally doubled explained these data 
better than a model in which chromosome number only 
evolved via the birth-death process (fig. 3). Furthermore, 
simulations indicated that the birth-death process was 
a poor fit for our chromosome number data when used 
alone, but a reasonable fit for the background rate of chro- 
mosome number evolution in the presence of WGDs. We 
identified three potential instances of paleopolyploidy (figs. 
4-6). In one case, we could clearly identify the branch on 
which the WGD occurred; in the other two cases, we could 
narrow down the position of the WGD to one of two 
branches. Based on the assumptions inherent to our model, 
support for WGDs within the Mollusca in general and in 
these three clades in particular is substantial. 



In the Caenogastropoda, comparative analysis of chro- 
mosome counts suggests that a WGD occurred in the com- 
mon ancestor of a clade containing the Capulidae, the 
Ranellidae, the Cypraeidae, and the Neogastropoda after 
their divergence from the Strombidae and the other hypso- 
gastropod families included in our analysis (fig. 4). This pa- 
leopolyploidy event was strongly supported by all three sets 
of branch lengths (posterior probability > 0.999), and there 
was no support for a WGD on any other branch in the Hyp- 
sogastropoda. Our interpretation of the fossil record indi- 
cates that this WGD occurred at some point between the 
beginning of the Jurassic (203 Ma) when the first Strombi- 
dae fossils appear and the lower Cretaceous (1 55 Ma) when 
the Neogastropoda initially radiated. Our results essentially 
agree with Hinegardner's (1974) suggestion that the neo- 
gastropods arose by polyploidy based on their genome size 
and chromosome number, although we inferred that this 
WGD occurred slightly earlier. If a WGD did occur at the base 
of the Neogastropoda, our phylogeny would imply two ad- 
ditional independent WGDs at the base of Capulidae and 
the Cypraeidae-Ranelidae. Considering the overall low rate 
of genome duplication in our analysis, such a scenario is 
highly unlikely However, chromosome counts from groups 
not represented in this analysis (e.g., Ficidae and Cassidae) 
may affect this result. 

We identified another likely WGD early in the history of 
the Stylommatophora: either at the beginning of the Ce- 
nozoic (65 Ma) in the common ancestor of the Sigmurethra 
and the Orthurethra after they diverged from the Succinei- 
dae or in the common ancestor of all the Stylommatophora 
after they diverged from the other pulmonates in the lower 
Cretaceous (138 Ma) and before they radiated at the be- 
ginning of the Cenozoic (65 Ma) (fig. 5a). Steusloff (1942) 
had also suggested a paleopolyploidy in the Stylommato- 
phora, although we were able to more accurately locate 
the event. We could not establish a length for the 
Sigmurethra-Orthurethra branch or many of the branches 
immediately descended from it, as several extant Stylom- 
matophoran families appear at the beginning of the Ceno- 
zoic and the phylogenetic position of Stylommatophora 
fossils appearing before then is uncertain (Zilch 1959- 
1960; Solem and Yochelson 1979) (fig. 2). Therefore, 
we treated each of these branches as one branch of min- 
imum length within a relatively fast radiation. As a conse- 
quence, support for a paleopolyploidy event on the 
Sigmurethra-Orthurethra branch decreased for shorter 
minimum branch lengths, as the posterior probability of 
a WGD is proportional to the length of the branch. There 
was a concomitant increase in the posterior probability on 
the Stylommatophora branch, but it was not sufficient to 
compensate for the decrease on the Sigmurethra-Orthur- 
ethra branch, thus total support for a WGD on either of 
these branches decreased with shorter minimum branch 
lengths (fig. 5b). As the length of the Sigmurethra- 



Genome Biol. Evol. 3:1150-1 163. doi:10.1093/gbe/evr087 Advance Access publication August 22, 201 1 



1159 



Hallinan and Lindberg 



GBE 



Orthurethra branch affects the posterior probability of both 
branches and the maximum support on the Sigmurethra-Or- 
thurethra branch is higher than the maximum support for the 
Stylommatophora branch, we conclude that the lack of sup- 
port for WGDs on the Sigmurethra-Orthurethra branch is in 
fact an artifact of our method of assigning branch lengths, 
and that it is more reasonable to conclude that a WGD oc- 
curred there than on the Stylommatophora branch. It should 
be noted that the number of chromosomes was highly vari- 
able among the Succineidae taxa (supplementary table. Sup- 
plementary Material online); thus, it is possible that selecting 
a different chromosome number to represent the Succineidae 
would lead to different conclusions about the location of the 
paleopolyploidy event. A well-resolved phylogeny of the Suc- 
cineidae is necessary to better reconstruct the number of 
chromosomes in their last common ancestor but such a phy- 
logeny is not currently available. 

Our data suggested that a third WGD occurred within 
the Cephalopoda, although the exact location of this pa- 
leopolyploidy event is not clearly reconstructed (fig. 6a). 
All sets of branch lengths support a duplication in the 
common ancestor of the Coleoidea after they diverged 
from the Nautiloidea in the lower Ordovician (490 Ma) 
but before the Decapodiformes split from the Octopoda 
in the Carboniferous (300 Ma). On the other hand, there 
is also some support for a duplication occurring in the 
common ancestor of the Decapodiformes after they split 
from the Octopoda but before the Sepiidae and the Loli- 
ginidae split in the lower Jurassic (200 Ma). There is also 
weak support for multiple WGDs within the Cephalopoda 
either on both these branches or on some combination of 
these branches and other branches within the Cephalopo- 
da (fig. 6b). Inspection of chromosome counts within the 
Cephalopoda implies another scenario. The Octopoda 
have 30 chromosomes in their haploid genome on the 
other hand the Sepiidae and the Loliginidae both have 
46. Given that we reconstructed 15 chromosomes in 
the ancestral mollusc, this implies that the Octopoda 
are tetraploid, whereas the Decapodiformes are hexa- 
ploid. Our model does not account for changes in ploidy 
other than doublings and thus would be limited to iden- 
tifying tetraploids and octoploids. The hexaploid Decapo- 
diformes likely caused the uncertainty in reconstructing 
the WGD within the Cephalopoda. Although variation 
in chromosome counts within both the Sepiidae and 
the Loliginidae is overall very low, there are taxa in both 
families that have been reported to have many more chro- 
mosomes than the mode for the family (supplementary 
table. Supplementary Material online), such as Loligo plei 
(84) and Sepia officinalis (56). If these values are correct it 
is possible that the ancestral Decapodiformes had more 
chromosomes than are suggested by our reconstructions. 
This would increase support for a WGD on the Decapodi- 
formes branch. 



It has been suggested that WGDs can lead to large mor- 
phological and physiological innovations, as redundancy 
eases the constraints on genes throughout the genome 
(Haldane 1932; Ohno 1967). The three clades recognized 
here fit this proposition well. The Stylommatophora repre- 
sent the dominant group of land snails and slugs and have 
successfully made the difficult transition from an aquatic to 
terrestrial life style and all the associated anatomical and 
physiological changes that this transition requires (Barker 
2001; Mordan and Wade 2008). The coleoid cephalopods 
are also a large group with a diversity of body forms and 
ecologies and represent a large jump in anatomical, physi- 
ological, and behavioral complexity. They are characterized 
by an internalized or reduced shell, a muscular mantle for 
locomotion, chromatophores, ink sacs, an acute vision sys- 
tem, and complex and morphologically distinct arms (Nishi- 
guchi and Mapes 2008). Lastly, according to our 
reconstructions, the WGD event identified in the Caenogas- 
tropoda precedes the radiation of the Neogastropoda by on- 
ly a short time and may have played a major role in their 
diversification. Neogastropoda are a large and diverse clade 
that have undergone extraordinary radiations as seen in 
their anatomical, physiological, behavioral, and ecological 
diversity (Ponder et al. 2008). 

Although the WGDs suggested here are the only well- 
supported paleopolyploidies in the Mollusca that we are 
aware of, more recent polyploidies have long been recog- 
nized in many species of molluscs. Several polyploid species 
have been identified in the wild (e.g., Patterson and Burch 
1978; Barsiene et al. 1996; Park 2008) and in at least one 
case the origin of a polyploidy has been analyzed karyolog- 
ically (Goldman et al. 1983). Polyploidy has also been arti- 
ficially induced in several species of commercial molluscs 
(e.g., Beaumont and Fairbrother 1991; Yang and Guo 
2006; Le Pennec et al. 2007). Furthermore, polyploidy is 
common in the somatic tissue of many Molluscs (Anisimov 
et al. 1995; Tabakova et al. 2005; Tokmakova et al. 2006). 
Within several of the terminal taxa in our analysis, we rec- 
ognized members with approximately twice the number of 
chromosomes as the mode for that family, and these may 
also be a consequence of more recent WGDs (supplemen- 
tary table. Supplementary Material online). 

We observed a strong phylogenetic signal in mollusc 
chromosome number with no tendency for the number 
of chromosomes to increase or decrease. Some researchers 
had previously noticed that taxonomic groups within the 
Mollusca tended to have similar numbers of chromosomes 
(Patterson and Burch 1 978; Nakamura 1 985), as our analysis 
confirmed (P << 0.001). However, others have observed 
a tendency for chromosome numbers to either increase 
(e.g., Patterson and Burch 1978) or decrease (e.g., Butot 
and Kiauta 1969; Ahmed 1976). The maximum likelihood 
analysis presented here suggests that there is no trend in 
chromosome number, although the WGDs could be 
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construed as representing a positive trend. Vinogradov 
(2000) detected a general tendency to large genome sizes 
in terrestrial pulmonates. However, as his analysis was not 
phylogenetic, the signal may be a consequence of large 
chromosome counts in the Stylommatophora. 

IVlayrose et al. (2010) described a similar method to de- 
duce the phylogenetic location of paleopolyploidies and 
used a maximum lil<elihood implementation of it to analyze 
chromosome counts from several plant taxa. The theory be- 
hind both methods is essentially the same, although they 
established a rate matrix and approximated the exponent 
of that rate matrix to calculate the transition probabilities. 
Thus, they were able to easily use a wider variety of models 
including models similar to those used here and models in 
which the rates of chromosome increase and decrease were 
independent of the number of chromosomes, which was 
a slightly better fit for their data. They were also able to in- 
clude rates for genome doubling, genome halving, and ge- 
nome increasing by 50% and to exclude the possibility of 
reaching zero chromosomes from their models. Our re- 
search was almost complete at the time that IVlayrose 
et al. (2010) was published, and so our analysis does not 
incorporate their methods, but the two methods are very 
similar and likely to yield the same results. Here, we estab- 
lished an elaborate hypothesis testing framework using sim- 
ulations to confirm the fit of our model to the data set, 
which could easily be incorporated into the method 
described by Mayrose et al. (2010). 

The methodology we have presented here is likely appli- 
cable to other taxa provided that the background rate of 
chromosome number evolution is low enough relative to 
the time spans involved. However, background rates of 
change in chromosome number in the Mollusca and, partic- 
ularly, in gastropods may be unusually low (Chambers 
1987). In other taxa, the background rate may drown out 
the signal of WGDs. It would also be possible to apply this 
method to counts of gene family members or large syntenic 
regions, which would also be expected to double after 
a WGD. Much support for paleopolyploidy has come from 
the identification of repeated syntenic regions (e.g., Holland 
et al. 1994; Abi-Rached et al. 2002; Wong et al. 2002). 

The results presented here represent only the first step in 
the study of molluscan paleopolyploidies, and we hope it 
may help guide future research into molluscan polyploidy 
and in the selection of molluscan taxa for whole genome 
sequencing (fig. 1). Ideally the study of WGDs requires mul- 
tiple genome sequences (Wong et al. 2002; Woods et al. 
2005; Scannell et al. 2007), but currently there exist less 
than five genome sequencing and annotation efforts within 
the Mollusca (Chapman et al. 2007). As costs decrease and 
bioinformatic methodologies continue to rapidly advance, 
the method we outline here may prove useful in selecting 
taxa for the study of molluscan paleopolyploidy In addition, 
researchers interested in paleopolyploidy in non-molluscan 



taxa may also find the methods developed here useful in 
their selection of taxa for whole genome sequencing by 
identifying branches on which the number of chromosomes 
or members of gene families have doubled. 

Supplementary Material 

Supplementary table is available at Genome Biology and 
Evolution online (http://www.gbe.oxfordjournals.org/). 
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